Sampling zoo integration#1423
Conversation
|
Code in this pull request contains PEP8 errors, please write the |
| if strategy_kind == 'chunking': | ||
| return self._sample_chunking( | ||
| factory=factory, | ||
| features=features, | ||
| target=target, | ||
| strategy=strategy, | ||
| strategy_params=strategy_params, | ||
| random_state=random_state | ||
| ) | ||
| elif strategy_kind == 'subset': | ||
| return self._sample_subset( | ||
| factory=factory, | ||
| features=features, | ||
| target=target, | ||
| strategy=strategy, | ||
| strategy_params=strategy_params, | ||
| random_state=random_state, | ||
| injectable_params=injectable_params | ||
| ) | ||
| else: | ||
| raise ValueError(f'Unsupported sampling strategy kind: {strategy_kind}') |
There was a problem hiding this comment.
все подобные ветвления переписать на проверки на вхождение в перечислимый тип или маппинг (словарь) для поддержки расширяемости
strategy_kind in available_strategies
или в данном случае
return available_sample_methods[strategy_kind]
There was a problem hiding this comment.
позже неудобно будет добавлять новые стратегии сэмплирования при таком подходе, который сейчас в реализации
| def _sample_subset(self, | ||
| factory: Any, | ||
| features: np.ndarray, | ||
| target: np.ndarray, | ||
| strategy: str, | ||
| strategy_params: Dict[str, Any], | ||
| random_state: Optional[int], | ||
| injectable_params: Optional[Dict[str, Any]]) -> SamplingProviderResult: | ||
| n_rows = int(features.shape[0]) |
There was a problem hiding this comment.
выносить в pure функции вне SamplingProvider
| def _execute_chunking(self, | ||
| train_data: InputData, | ||
| started_at: float, | ||
| budget_seconds: float) -> SamplingStageOutput: | ||
| self._raise_if_budget_exceeded(started_at, budget_seconds) | ||
| remaining_budget = self._remaining_budget(started_at, budget_seconds) |
There was a problem hiding this comment.
_execute_* методы вынести в pure фунции, decouple executors
| return np.asarray(target)[indices] | ||
|
|
||
| @staticmethod | ||
| def _partitions_to_input_data_list(partitions: Dict[str, Any], |
There was a problem hiding this comment.
очень нагруженный метод _partitions_to_input_data_list, посмотреть наработки @Romankkl03 по TensorData - изучить новый протокол потока данных и адаптировать работу в этом PR
| _SAMPLING_MODULE_CANDIDATES = ( | ||
| 'sampling_zoo.core.api.api_main', | ||
| 'sampling_zoo.api.api_main', | ||
| 'core.api.api_main', |
There was a problem hiding this comment.
сразу же отказаться от внутренней зависимости
Summary
This PR continues the first stage of Sampling Zoo integration. Chunking and subset strategies are now explicitly separated by
strategy_kind. Subset runs follow the standard single‑dataset training path with sample selection, while chunking produces multipleInputDatapartitions and trains aPipelineEnsembleover them. The ensemble replaces the current pipeline for predict, and preserves existing API behavior where possible.Context